Skip to content

Draft ADR for modular observability pipeline#1665

Draft
cjnolan wants to merge 7 commits intomainfrom
modular-observability-adr
Draft

Draft ADR for modular observability pipeline#1665
cjnolan wants to merge 7 commits intomainfrom
modular-observability-adr

Conversation

@cjnolan
Copy link
Copy Markdown
Contributor

@cjnolan cjnolan commented Mar 31, 2026

Description

Please include a summary of the changes and the related issue. List any dependencies that are required for this change.

Fixes # (issue)

Any Newly Introduced Dependencies

Please describe any newly introduced 3rd party dependencies in this change. List their name, license information and how they are used in the project.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Checklist:

  • I agree to use the APACHE-2.0 license for my code changes
  • I have not introduced any 3rd party dependency changes
  • I have performed a self-review of my code

Copy link
Copy Markdown
Contributor

@palade palade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments

To support collection of additional HW metrics from GPU, PMU, cache utilization, etc., the current POA implementation
will be expanded to include new metrics collectors for these HW components. Also, modifications will be made to
the Edge Node Observability pipeline deployment in the orchestrator to allow it to be deployable as a standalone
pipeline without requiring other components from the EMF stack.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without requiring other components from the EMF stack

Which are these components?

- **BIOS Metrics**: One option for these metrics is to use the [Telegraf redfish collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/redfish)
to retrieve thermal and power settings.

##### New Metrics to Configure and Enable
Copy link
Copy Markdown
Contributor

@palade palade Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of these will require additional permissions to be enabled on the device side and will likely increase resource utilization. Is there any QoS currently enabled to support existing and additional metrics collection? e.g., best-effort collection?

Instead the Orchestrator Command Line Interface (CLI) tool will be extended to provide commands for a user
to run to query the Mimir backend for metrics.

The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify this, is the query going to the orchestrator or to the edge node? Or if the requested data is not found on the orchestrator side, then it will be sent to the edge node?

to run to query the Mimir backend for metrics.

The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as
any time range required by user. If a time range is not provided, then the CLI should use a default time range,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a concern that is already addressed, but how is the clock synchronization is ensured across orchestrator and edge devices to ensure that the requested time range is the same across devices and there is no offset?

The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as
any time range required by user. If a time range is not provided, then the CLI should use a default time range,
such as the last 5 minutes. The CLI should also support retrieving both averages and sums for metrics over set time
periods.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the request made for a single node or for all the nodes? how does this scale?

periods.

Within the CLI, it should convert the received query into the PromQL format needed for querying Mimir
and then send the PromQL query to the Mimir API. When the CLI receives the metrics back from Mimir,
Copy link
Copy Markdown
Contributor

@palade palade Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if, for some reason, metrics collection fails or becomes unavailable during the requested time range? What if the data is only partially available?

- Provide documentation on how to install the modular observability workflow.
- Extend Orchestrator CLI documentation with new commands for metrics querying.

## Opens
Copy link
Copy Markdown
Contributor

@palade palade Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has support for multi-vendor environments been considered?

## Implementation Plan

- Hardware Metrics Collection.
- Identify the new hardware metrics collectors to be added to the current edge node metrics service.
Copy link
Copy Markdown
Contributor

@palade palade Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the performance implications when additional metrics are collected, both on the orchestrator side and across the network?

pipeline and can be used to configure what metrics an edge node reports after it has been deployed without
requiring a full redeployment or access to the edge node. For modular deployments, should this also be included
and used for this purpose or should it be exlcuded?
- Investigate the [Intel Performance Counter Monitor(PCM)](https://github.com/intel/pcm) tool as there may be
Copy link
Copy Markdown
Contributor

@palade palade Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the hardware metrics collection fails? i.e., there is some hardware malfunction, or due to misconfigurations, or there is some disruption to sensors doing readings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants